Get data

In order to get the data for this tutorial, you’ll need to get an API key to use the Census API service here. Once you have it, assign the key to the variable api_key.

api_key <- "put api key here"

Now that you have the API key, run the following code that is available on the Storytelling repository on Github. This will automatically pull in code that and assemble the dataset for the visualizations in this section.

source("https://raw.githubusercontent.com/CommerceDataService/cda_storytelling_in_r/gh-pages/get_data.R")

What’s in the data?

Sometimes two dimensional visuals are not enough. There is a lot more to the data that can be used to contextualize latent patterns. Often times, many analysts tend to think in two-dimensions – like scatter plots. But there’s more to it. In the dataset that you’ve just imported, it has the following characteristics:

summary(data)
##     GEOID               state        geography             region     
##  Length:3142        Min.   : 1.00   Length:3142        Min.   :1.000  
##  Class :character   1st Qu.:18.00   Class :character   1st Qu.:2.000  
##  Mode  :character   Median :29.00   Mode  :character   Median :3.000  
##                     Mean   :30.28                      Mean   :2.669  
##                     3rd Qu.:45.00                      3rd Qu.:3.000  
##                     Max.   :56.00                      Max.   :4.000  
##                                                                       
##                name       pct_poverty      pct_unemp      pct_hs_grad   
##  South Region    :1422   Min.   : 0.00   Min.   : 0.00   Min.   :46.70  
##  Midwest Region  :1055   1st Qu.: 8.30   1st Qu.: 3.70   1st Qu.:80.70  
##  West Region     : 448   Median :11.50   Median : 4.90   Median :86.40  
##  Northeast Region: 217   Mean   :12.35   Mean   : 4.94   Mean   :84.98  
##  Alabama         :   0   3rd Qu.:15.30   3rd Qu.: 6.10   3rd Qu.:90.10  
##  Alaska          :   0   Max.   :45.50   Max.   :20.90   Max.   :98.70  
##  (Other)         :   0

Let’s say we were provided a nice clean set of data that contains the following:

What can you do with that data? Well, turns out that that these quantities are related.

How did we get to this?

#Load in Threejs library      
      library(threejs)

We can see that there are direct relationships between unemployment, poverty and education attainment. But there isn’t much detail and the graphs aren’t pretty.

        scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad)

Let’s stylize the plots. First let’s name the axes with axisLabels, which accepts a vector of axis names. The order matters and is as follows: x-axis, z-axis, y-axis

      #Note that axis Labels should follow this order= c(x, z, y)
        scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad,   
                     axisLabels=c("unemployment","hs degree or above","poverty rate"))    

Now let’s change the rendering engine to give more depth to the plot. We do so by changing renderer = “canvas”. This just tells R threejs to use a different package to render the points

      #Depth using render
        scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad, 
                       axisLabels=c("unemployment","hs degree or above","poverty rate"),
                       renderer="canvas")   

Now, let’s set the color of the points, resize the points, and flip the y axis so it’s ascending from the origin. To do so, we: - set col = “slategrey” - set flip.y = FALSE - set size = 0.5

      #Point size, color, don't flip y axis
        scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad, 
                       axisLabels=c("unemployment","hs degree or above","poverty rate"),
                       renderer="canvas",  flip.y=FALSE, col="slategrey",
                       size=0.5)   

Ultimately, we want to find more patterns. By using color, we can group regions by color. We can see some regions are worse off than others. But which? Turns out there are 4 regions:

   unique(data$region_name)     
## NULL
    unique(data$region)  
## [1] 1 2 3 4

First, let’s set each region to a different color by first creating a new variable for colors data$colors, then assign a hexcode to each region.

      #Set up colors by 
        data$colors <- ""
        data$colors[data$region==1] <- "#011efe0"
        data$colors[data$region==2] <- "#0bff01"
        data$colors[data$region==3] <- "#fe00f6"
        data$colors[data$region==4] <- "#fdfe02"

Now, let’s set col= data$colors so that R knows which color corresponds to each of the 3000 points.

      data <- data[order(data$region),]
      #Grouped patterns
        scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad, 
                       axisLabels=c("unemployment","hs degree or above","poverty rate"),
                       col=data$colors,  flip.y=FALSE, 
                       renderer="canvas", 
                       size=0.5)   

It’s a bit annoying to look at the chart without knowing which point corresponds to which county. Let’s add labels for each point that show up upon mousing over.

      #add labels to points
      scatterplot3js(data$pct_unemp, data$pct_poverty,data$pct_hs_grad,   
                     axisLabels=c("unemployment","hs degree or above","poverty rate"),
                     col=data$colors,
                     labels=paste(data$region_name,": ",data$geography), 
                     size=0.5,
                    renderer="canvas")

In short, we can tell the following key insights from this graph.

Part 2: Maps

Sometimes graphs don’t get the point across. Maps, while over used, can provide some better indication of patterns.

Based on our 3-d graphs, we could see clustering of regions’s economic performance. We can see the mess of points more clearly on a map.

## OGR data source with driver: ESRI Shapefile 
## Source: "cb_2014_us_county_20m.shp", layer: "cb_2014_us_county_20m"
## with 3220 features
## It has 9 fields

Getting started

We can use the leaflet library to bring a geographic spin to the data. To initiate a map, we only need to open the leaflet library, then run the following:

      library(leaflet)
      leaflet()  

You’ll see that the map is blank with a zoom control panel on the upper left. That’s because the map doesn’t have data in it. There are dozens on free layers we can use:

  • Stamen.Toner
  • CartoDB.Positron
  • HikeBike.HikeBike
  • CartoDB.DarkMatter
        leaflet() %>%
        addProviderTiles("Stamen.Toner") 
        leaflet() %>%
        addProviderTiles("CartoDB.Positron") 

Now let’s center and zoom in on the contiguous US at coordinates lon = -98.3 and lat = 39.5

        leaflet() %>%
        addProviderTiles("CartoDB.Positron") %>%
        setView(lng = -98.3, lat = 39.5, zoom = 4) 

We now need to download the shapefiles, which are a popular geospatial vector data format for geographic information system (GIS) software. Shapefiles allow for rendering of various types of data, including points (e.g. coordinates), polygons (e.g. county boundaries), and lines (e.g. streets, creeks). We’re going to use the US County Shapefile from the US Census: http://www2.census.gov/geo/tiger/GENZ2014/shp/cb_2014_us_county_20m.zip. To download it and load it into R, we’ll need to first install the rgdal library. We’ve also written a function shape_direct that can be run to import the shapefile and assign it to an object shp.

      shape_direct <- function(url, shp) {
        library(rgdal)
        temp = tempfile()
        download.file(url, temp) ##download the URL taret to the temp file
        unzip(temp,exdir=getwd()) ##unzip that file
        return(readOGR(paste(shp,".shp",sep=""),shp))
      }
      
      shp <- shape_direct(url="http://www2.census.gov/geo/tiger/GENZ2014/shp/cb_2014_us_county_20m.zip",
                          shp= "cb_2014_us_county_20m")
## OGR data source with driver: ESRI Shapefile 
## Source: "cb_2014_us_county_20m.shp", layer: "cb_2014_us_county_20m"
## with 3220 features
## It has 9 fields
## Warning in readOGR(paste(shp, ".shp", sep = ""), shp): Z-dimension
## discarded

With the new shapefile now imported, we can now set data = shp.

        leaflet(data=shp) %>%
        addProviderTiles("CartoDB.Positron") %>%
        setView(lng = -98.3, lat = 39.5, zoom = 4) %>%
        addPolygons(fillColor = "blue", 
                    fillOpacity = 0.8, 
                    color = "white", 
                    weight = 0.5)     

In order to draw insight from a map, we’ll need to color code county polygons. This is known as a choropleth map – each county is color coded with respect to a certain value of a given variable like %poverty. The shapefile on its own doesn’t have the socioeconomic data and we’ll need to join the data to the shapefile. Let’s just quickly check the data formats of the primary key GEOID.

   str(shp@data$GEOID)
##  Factor w/ 3220 levels "01001","01003",..: 560 741 2222 2660 2471 2200 2392 2707 770 853 ...
   str(data$GEOID)
##  chr [1:3142] "09001" "09003" "09005" "09007" "09009" ...

Since the shp primary key is in a factor format and the data primary key is in string or character format, we’ll need to conform the formats, preferrably to strings. Then, we can merge the two datasets.

      shp@data$GEOID <- as.character(shp@data$GEOID)
      shp <- merge(shp,data,id="GEOID")

With the merged datasets, we’ll now need to specify a color scheme. Using colorQuantile, we can create a create a function that will slice any continuous variable into bins and assign colors to each bin. The syntax is as follows:

      pal <- colorQuantile(<Color Code>, <variable>, n = <number of bins>)

For our example, we’ll use a Yellow-Green palette, leave as NULL so that we can re-use the palette function, and set the number of bins to 30.

      palette <- colorQuantile("YlGn", NULL, n = 30)

Next we will add popup text for when a user clicks on a county in a map.

      county_popup <- paste0("<strong>County: </strong>", 
                            shp@data$geography, 
                            "<br><strong>Poverty Rate (%): </strong>", 
                            shp@data$pct_poverty)

Now we’ll pull it all together.

      leaflet(data = shp) %>%
        addProviderTiles("CartoDB.Positron") %>%
        setView(lng = -98.3, lat = 39.5, zoom = 4) %>%
        addPolygons(fillColor = ~palette(pct_poverty), 
                    fillOpacity = 0.8, 
                    color = "#BDBDC3", 
                    weight = 0.1, 
                    popup = county_popup)     

We can run the same graph for pct_unemp by swapping out pct_poverty

       county_popup <- paste0("<strong>County: </strong>", 
                            shp@data$geography, 
                            "<br><strong>Unemp (%): </strong>", 
                            shp@data$pct_unemp)
      
      leaflet(data = shp) %>%
        addProviderTiles("CartoDB.Positron") %>%
        setView(lng = -98.3, lat = 39.5, zoom = 4) %>%
        addPolygons(fillColor = ~palette(pct_unemp), 
                    fillOpacity = 0.8, 
                    color = "#BDBDC3", 
                    weight = 0.1, 
                    popup = county_popup)     

```